Assignment-1

Author

Emma Longo

library(magrittr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract()   masks magrittr::extract()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::lag()       masks stats::lag()
✖ purrr::set_names() masks magrittr::set_names()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(data.table)

Attaching package: 'data.table'

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose
library(leaflet)
library(ggplot2)
library(dplyr)
library(tidyr)

Assignment 01 - Exploratory Data Analysis

Learning Goals

  • Download, read, and get familiar with an external dataset.

  • Step through the EDA “checklist” presented in class

  • Practice making exploratory plots

Assignment Description

We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).

A primer on particulate matter air pollution can be found here.

Your assignment should be completed in Quarto or R Markdown.

Steps

  1. Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data table. For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.

2002: There are 15976 observations for 20 variables. The first date in the data set is 01/05/2022 from Livermore. The last date in the data set is 12/31/2022 from Woodland-Gibson Road. The range of daily mean PM 2.5 concentration is 104.30 with a mean of 16.12.

2022: There are 56140 observations for 20 variables. The first date in the data set is 01/01/2022 from Livermore. The last date in the data set is 12/31/2022 from Woodland-Gibson Road. The range of daily mean PM 2.5 concentration is 304.7 with a mean of 8.52.

Looking at the key variable (PM 2.5 concentration), the minimum is -2.20 but it does not make sense to have negative values for concentration (the minimum should be 0).

data_2002 <- fread("ad_viz_plotval_data.csv")
dim(data_2002)
[1] 15976    20
head(data_2002)
         Date Source  Site ID POC Daily Mean PM2.5 Concentration    UNITS
1: 01/05/2002    AQS 60010007   1                           25.1 ug/m3 LC
2: 01/06/2002    AQS 60010007   1                           31.6 ug/m3 LC
3: 01/08/2002    AQS 60010007   1                           21.4 ug/m3 LC
4: 01/11/2002    AQS 60010007   1                           25.9 ug/m3 LC
5: 01/14/2002    AQS 60010007   1                           34.5 ug/m3 LC
6: 01/17/2002    AQS 60010007   1                           41.0 ug/m3 LC
   DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1:              78 Livermore               1              100
2:              92 Livermore               1              100
3:              71 Livermore               1              100
4:              80 Livermore               1              100
5:              98 Livermore               1              100
6:             115 Livermore               1              100
   AQS_PARAMETER_CODE       AQS_PARAMETER_DESC CBSA_CODE
1:              88101 PM2.5 - Local Conditions     41860
2:              88101 PM2.5 - Local Conditions     41860
3:              88101 PM2.5 - Local Conditions     41860
4:              88101 PM2.5 - Local Conditions     41860
5:              88101 PM2.5 - Local Conditions     41860
6:              88101 PM2.5 - Local Conditions     41860
                           CBSA_NAME STATE_CODE      STATE COUNTY_CODE  COUNTY
1: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
2: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
3: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
4: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
5: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
6: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
   SITE_LATITUDE SITE_LONGITUDE
1:      37.68753      -121.7842
2:      37.68753      -121.7842
3:      37.68753      -121.7842
4:      37.68753      -121.7842
5:      37.68753      -121.7842
6:      37.68753      -121.7842
tail(data_2002)
         Date Source  Site ID POC Daily Mean PM2.5 Concentration    UNITS
1: 12/10/2002    AQS 61131003   1                             15 ug/m3 LC
2: 12/13/2002    AQS 61131003   1                             15 ug/m3 LC
3: 12/22/2002    AQS 61131003   1                              1 ug/m3 LC
4: 12/25/2002    AQS 61131003   1                             23 ug/m3 LC
5: 12/28/2002    AQS 61131003   1                              5 ug/m3 LC
6: 12/31/2002    AQS 61131003   1                              6 ug/m3 LC
   DAILY_AQI_VALUE            Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1:              57 Woodland-Gibson Road               1              100
2:              57 Woodland-Gibson Road               1              100
3:               4 Woodland-Gibson Road               1              100
4:              74 Woodland-Gibson Road               1              100
5:              21 Woodland-Gibson Road               1              100
6:              25 Woodland-Gibson Road               1              100
   AQS_PARAMETER_CODE       AQS_PARAMETER_DESC CBSA_CODE
1:              88101 PM2.5 - Local Conditions     40900
2:              88101 PM2.5 - Local Conditions     40900
3:              88101 PM2.5 - Local Conditions     40900
4:              88101 PM2.5 - Local Conditions     40900
5:              88101 PM2.5 - Local Conditions     40900
6:              88101 PM2.5 - Local Conditions     40900
                                 CBSA_NAME STATE_CODE      STATE COUNTY_CODE
1: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
2: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
3: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
4: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
5: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
6: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
   COUNTY SITE_LATITUDE SITE_LONGITUDE
1:   Yolo      38.66121      -121.7327
2:   Yolo      38.66121      -121.7327
3:   Yolo      38.66121      -121.7327
4:   Yolo      38.66121      -121.7327
5:   Yolo      38.66121      -121.7327
6:   Yolo      38.66121      -121.7327
str(data_2002)
Classes 'data.table' and 'data.frame':  15976 obs. of  20 variables:
 $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Daily Mean PM2.5 Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ UNITS                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ DAILY_AQI_VALUE               : int  78 92 71 80 98 115 87 57 65 107 ...
 $ Site Name                     : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ DAILY_OBS_COUNT               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ PERCENT_COMPLETE              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS_PARAMETER_CODE            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS_PARAMETER_DESC            : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ CBSA_CODE                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA_NAME                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ STATE_CODE                    : int  6 6 6 6 6 6 6 6 6 6 ...
 $ STATE                         : chr  "California" "California" "California" "California" ...
 $ COUNTY_CODE                   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTY                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ SITE_LATITUDE                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ SITE_LONGITUDE                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
summary(data_2002$`Daily Mean PM2.5 Concentration`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    7.00   12.00   16.12   20.50  104.30 
data_2022 <- fread("ad_viz_plotval_data (1).csv")
dim(data_2022)
[1] 56140    20
head(data_2022)
         Date Source  Site ID POC Daily Mean PM2.5 Concentration    UNITS
1: 01/01/2022    AQS 60010007   3                           12.7 ug/m3 LC
2: 01/02/2022    AQS 60010007   3                           13.9 ug/m3 LC
3: 01/03/2022    AQS 60010007   3                            7.1 ug/m3 LC
4: 01/04/2022    AQS 60010007   3                            3.7 ug/m3 LC
5: 01/05/2022    AQS 60010007   3                            4.2 ug/m3 LC
6: 01/06/2022    AQS 60010007   3                            3.8 ug/m3 LC
   DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1:              52 Livermore               1              100
2:              55 Livermore               1              100
3:              30 Livermore               1              100
4:              15 Livermore               1              100
5:              18 Livermore               1              100
6:              16 Livermore               1              100
   AQS_PARAMETER_CODE       AQS_PARAMETER_DESC CBSA_CODE
1:              88101 PM2.5 - Local Conditions     41860
2:              88101 PM2.5 - Local Conditions     41860
3:              88101 PM2.5 - Local Conditions     41860
4:              88101 PM2.5 - Local Conditions     41860
5:              88101 PM2.5 - Local Conditions     41860
6:              88101 PM2.5 - Local Conditions     41860
                           CBSA_NAME STATE_CODE      STATE COUNTY_CODE  COUNTY
1: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
2: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
3: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
4: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
5: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
6: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
   SITE_LATITUDE SITE_LONGITUDE
1:      37.68753      -121.7842
2:      37.68753      -121.7842
3:      37.68753      -121.7842
4:      37.68753      -121.7842
5:      37.68753      -121.7842
6:      37.68753      -121.7842
tail(data_2022)
         Date Source  Site ID POC Daily Mean PM2.5 Concentration    UNITS
1: 12/01/2022    AQS 61131003   1                            3.4 ug/m3 LC
2: 12/07/2022    AQS 61131003   1                            3.8 ug/m3 LC
3: 12/13/2022    AQS 61131003   1                            6.0 ug/m3 LC
4: 12/19/2022    AQS 61131003   1                           34.8 ug/m3 LC
5: 12/25/2022    AQS 61131003   1                           23.2 ug/m3 LC
6: 12/31/2022    AQS 61131003   1                            1.0 ug/m3 LC
   DAILY_AQI_VALUE            Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
1:              14 Woodland-Gibson Road               1              100
2:              16 Woodland-Gibson Road               1              100
3:              25 Woodland-Gibson Road               1              100
4:              99 Woodland-Gibson Road               1              100
5:              74 Woodland-Gibson Road               1              100
6:               4 Woodland-Gibson Road               1              100
   AQS_PARAMETER_CODE       AQS_PARAMETER_DESC CBSA_CODE
1:              88101 PM2.5 - Local Conditions     40900
2:              88101 PM2.5 - Local Conditions     40900
3:              88101 PM2.5 - Local Conditions     40900
4:              88101 PM2.5 - Local Conditions     40900
5:              88101 PM2.5 - Local Conditions     40900
6:              88101 PM2.5 - Local Conditions     40900
                                 CBSA_NAME STATE_CODE      STATE COUNTY_CODE
1: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
2: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
3: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
4: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
5: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
6: Sacramento--Roseville--Arden-Arcade, CA          6 California         113
   COUNTY SITE_LATITUDE SITE_LONGITUDE
1:   Yolo      38.66121      -121.7327
2:   Yolo      38.66121      -121.7327
3:   Yolo      38.66121      -121.7327
4:   Yolo      38.66121      -121.7327
5:   Yolo      38.66121      -121.7327
6:   Yolo      38.66121      -121.7327
str(data_2022)
Classes 'data.table' and 'data.frame':  56140 obs. of  20 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily Mean PM2.5 Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ UNITS                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ DAILY_AQI_VALUE               : int  52 55 30 15 18 16 10 29 54 47 ...
 $ Site Name                     : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ DAILY_OBS_COUNT               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ PERCENT_COMPLETE              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS_PARAMETER_CODE            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS_PARAMETER_DESC            : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ CBSA_CODE                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA_NAME                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ STATE_CODE                    : int  6 6 6 6 6 6 6 6 6 6 ...
 $ STATE                         : chr  "California" "California" "California" "California" ...
 $ COUNTY_CODE                   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTY                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ SITE_LATITUDE                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ SITE_LONGITUDE                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
summary(data_2022$`Daily Mean PM2.5 Concentration`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -2.20    4.20    6.90    8.52   10.80  302.50 
  1. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
merged_data <- rbindlist(list(
data_2002[, year := 2002],
data_2022[, year := 2022]))

merged_data$PM2.5 <- merged_data$`Daily Mean PM2.5 Concentration`
merged_data$`Daily Mean PM2.5 Concentration` <- NULL

merged_data$lat <- merged_data$SITE_LATITUDE
merged_data$SITE_LATITUDE <- NULL

merged_data$lon <- merged_data$SITE_LONGITUDE
merged_data$SITE_LONGITUDE <- NULL

str(merged_data)
Classes 'data.table' and 'data.frame':  72116 obs. of  21 variables:
 $ Date              : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source            : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID           : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ UNITS             : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ DAILY_AQI_VALUE   : int  78 92 71 80 98 115 87 57 65 107 ...
 $ Site Name         : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ DAILY_OBS_COUNT   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ PERCENT_COMPLETE  : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS_PARAMETER_CODE: int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS_PARAMETER_DESC: chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ CBSA_CODE         : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA_NAME         : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ STATE_CODE        : int  6 6 6 6 6 6 6 6 6 6 ...
 $ STATE             : chr  "California" "California" "California" "California" ...
 $ COUNTY_CODE       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTY            : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ year              : num  2002 2002 2002 2002 2002 ...
 $ PM2.5             : num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ lat               : num  37.7 37.7 37.7 37.7 37.7 ...
 $ lon               : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
  1. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.

The monitoring sites are distributed throughout the entire state of California, with the highest density of sites in the San Francisco area and the Los Angeles / San Diego area. In addition, there is a much higher number of sites in 2022 as compared to 2022.

leaflet(merged_data) %>%
  addProviderTiles('CartoDB.Positron') %>%
  addCircleMarkers(
    lat = ~lat, 
    lng = ~lon, 
    color = ~ifelse(year == 2002, "red", "blue"),
    weight = 1, 
    opacity = 0.1,
    radius = 1) %>%
  addLegend(
    "topleft",
    colors = c("red", "blue"),
    labels = c(2002, 2022)) 
  1. Check for any missing or implausible values of PM2.5 in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.

There are 0 missing values and 207 implausible values of PM 2.5 in the combined dataset.

The implausible values are all negative values (-2.2 to -0.1), which as previously mentioned, does not make sense for a concentration variable. All of the implausible values are from the year 2022 and do not have a date associated with them.

# Missing values: 
sum(is.na(merged_data$PM2.5))
[1] 0
#Implausible values: 
sum(merged_data$PM2.5 < 0)
[1] 207
implausible_values <- merged_data %>% filter(PM2.5 < 0)

if (nrow(implausible_values) > 0) {
  summary(implausible_values)
} else {
  cat("No negative values in PM2.5\n")
}
     Date              Source             Site ID              POC      
 Length:207         Length:207         Min.   :60010011   Min.   :1.00  
 Class :character   Class :character   1st Qu.:60292009   1st Qu.:3.00  
 Mode  :character   Mode  :character   Median :60651016   Median :3.00  
                                       Mean   :60616431   Mean   :2.56  
                                       3rd Qu.:60832004   3rd Qu.:3.00  
                                       Max.   :61130004   Max.   :4.00  
                                                                        
    UNITS           DAILY_AQI_VALUE  Site Name         DAILY_OBS_COUNT
 Length:207         Min.   :0       Length:207         Min.   :1      
 Class :character   1st Qu.:0       Class :character   1st Qu.:1      
 Mode  :character   Median :0       Mode  :character   Median :1      
                    Mean   :0                          Mean   :1      
                    3rd Qu.:0                          3rd Qu.:1      
                    Max.   :0                          Max.   :1      
                                                                      
 PERCENT_COMPLETE AQS_PARAMETER_CODE AQS_PARAMETER_DESC   CBSA_CODE    
 Min.   :100      Min.   :88101      Length:207         Min.   :12540  
 1st Qu.:100      1st Qu.:88101      Class :character   1st Qu.:33045  
 Median :100      Median :88101      Mode  :character   Median :40900  
 Mean   :100      Mean   :88239                         Mean   :35740  
 3rd Qu.:100      3rd Qu.:88502                         3rd Qu.:42020  
 Max.   :100      Max.   :88502                         Max.   :47300  
                                                        NA's   :19     
  CBSA_NAME           STATE_CODE    STATE            COUNTY_CODE   
 Length:207         Min.   :6    Length:207         Min.   :  1.0  
 Class :character   1st Qu.:6    Class :character   1st Qu.: 29.0  
 Mode  :character   Median :6    Mode  :character   Median : 65.0  
                    Mean   :6                       Mean   : 61.5  
                    3rd Qu.:6                       3rd Qu.: 83.0  
                    Max.   :6                       Max.   :113.0  
                                                                   
    COUNTY               year          PM2.5              lat       
 Length:207         Min.   :2022   Min.   :-2.2000   Min.   :32.84  
 Class :character   1st Qu.:2022   1st Qu.:-0.7500   1st Qu.:34.84  
 Mode  :character   Median :2022   Median :-0.4000   Median :37.06  
                    Mean   :2022   Mean   :-0.5324   Mean   :36.95  
                    3rd Qu.:2022   3rd Qu.:-0.2000   3rd Qu.:38.61  
                    Max.   :2022   Max.   :-0.1000   Max.   :41.76  
                                                                    
      lon        
 Min.   :-124.2  
 1st Qu.:-122.1  
 Median :-121.2  
 Mean   :-120.5  
 3rd Qu.:-118.9  
 Max.   :-115.5  
                 
  1. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.

State: At the state level, the average PM 2.5 concentration was higher in 2002 (16.11) than in 2022 (8.52). However, both the summary statistics and the boxplot show that the range in 2002 (0 to 104.3) was much larger than in 2022 (-2.2 to 302.5).

# Summary Statistics 

table(merged_data$STATE)

California 
     72116 
merged_data %>%
  group_by(year) %>%
  summarize(mean_PM2.5 = mean(PM2.5, na.rm = TRUE),
            min_PM2.5 = min(PM2.5, na.rm = TRUE),
            max_PM2.5 = max(PM2.5, na.rm = TRUE))
# A tibble: 2 × 4
   year mean_PM2.5 min_PM2.5 max_PM2.5
  <dbl>      <dbl>     <dbl>     <dbl>
1  2002      16.1        0        104.
2  2022       8.52      -2.2      302.
# Exploratory Plot 

ggplot(merged_data, aes(x = as.factor(year), y = PM2.5)) +
  geom_boxplot() +
  labs(title = "Average PM 2.5 Concentration at State Level",
       x = "Year",
       y = "Average PM 2.5 Concentration")

County: At the county level, most of the 51 counties had higher average PM 2.5 concentrations in 2002 as compared to 2022. There were 5 counties that were exceptions to that trend: Del Norte (3.81 in 2002 and 4.96 in 2022), Mendocino (8.84 in 2002 and 10.13 in 2022), Mono (2.68 in 2002 and 4.69 in 2022), Siskiyou (2.69 in 2002 and 7.59 in 2022), Trinity (2.78 in 2002 and 10.72 in 2022). The line plot shows that these counties with an upward trend from 2002 to 2022 had some of the lowest starting values in 2002.

# Summary Statistics 

merged_data %>%
  group_by(year, COUNTY) %>%
  summarize(mean_PM2.5 = mean(PM2.5, na.rm = TRUE)) %>%
  pivot_wider(names_from = year, values_from = mean_PM2.5)
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 51 × 3
   COUNTY       `2002` `2022`
   <chr>         <dbl>  <dbl>
 1 Alameda       14.3    8.20
 2 Butte         14.8    6.19
 3 Calaveras      9.9    6.04
 4 Colusa        11.7    7.61
 5 Contra Costa  15.1    8.25
 6 Del Norte      3.82   4.97
 7 El Dorado      4.91   4.07
 8 Fresno        19.9   10.2 
 9 Humboldt       7.79   6.76
10 Imperial      12.7    9.67
# ℹ 41 more rows
merged_data %>%
  group_by(year, COUNTY) %>%
  summarize(min_PM2.5 = min(PM2.5, na.rm = TRUE)) %>%
  pivot_wider(names_from = year, values_from = min_PM2.5)
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 51 × 3
   COUNTY       `2002` `2022`
   <chr>         <dbl>  <dbl>
 1 Alameda         1.9   -0.7
 2 Butte           1     -0.6
 3 Calaveras       2      0  
 4 Colusa          1      0.6
 5 Contra Costa    2      0.9
 6 Del Norte       0     -0.8
 7 El Dorado       0.1    0.2
 8 Fresno          0.1   -0.3
 9 Humboldt        1.9    0.9
10 Imperial        1.1   -0.4
# ℹ 41 more rows
merged_data %>%
  group_by(year, COUNTY) %>%
  summarize(max_PM2.5 = max(PM2.5, na.rm = TRUE)) %>%
  pivot_wider(names_from = year, values_from = max_PM2.5)
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 51 × 3
   COUNTY       `2002` `2022`
   <chr>         <dbl>  <dbl>
 1 Alameda        61.6   35.5
 2 Butte          88     42.8
 3 Calaveras      40     25.9
 4 Colusa         57     37  
 5 Contra Costa   76.7   37.3
 6 Del Norte      12.3   21.2
 7 El Dorado      27    105  
 8 Fresno         92.5   55.8
 9 Humboldt       23.7   21.2
10 Imperial       46.5   70  
# ℹ 41 more rows
# Exploratory Plot 

county_level <- merged_data %>%
  group_by(COUNTY, year) %>%
  summarize(county_mean_PM2.5 = mean(PM2.5, na.rm = TRUE))
`summarise()` has grouped output by 'COUNTY'. You can override using the
`.groups` argument.
ggplot(county_level, aes(x = year, y = county_mean_PM2.5, color = COUNTY)) +
  geom_line() +
  labs(title = "Average PM 2.5 Concentration at County Level",
       x = "Year",
       y = "Average PM 2.5 Concentration",
       color = "County")

Site:

# Summary Statistics 

site_level <- merged_data %>%
  filter(`Site Name` == "Livermore")
site_level
           Date Source  Site ID POC    UNITS DAILY_AQI_VALUE Site Name
  1: 01/05/2002    AQS 60010007   1 ug/m3 LC              78 Livermore
  2: 01/06/2002    AQS 60010007   1 ug/m3 LC              92 Livermore
  3: 01/08/2002    AQS 60010007   1 ug/m3 LC              71 Livermore
  4: 01/11/2002    AQS 60010007   1 ug/m3 LC              80 Livermore
  5: 01/14/2002    AQS 60010007   1 ug/m3 LC              98 Livermore
 ---                                                                  
435: 12/27/2022    AQS 60010007   3 ug/m3 LC              15 Livermore
436: 12/28/2022    AQS 60010007   3 ug/m3 LC              30 Livermore
437: 12/29/2022    AQS 60010007   3 ug/m3 LC              20 Livermore
438: 12/30/2022    AQS 60010007   3 ug/m3 LC               3 Livermore
439: 12/31/2022    AQS 60010007   3 ug/m3 LC               6 Livermore
     DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
  1:               1              100              88101
  2:               1              100              88101
  3:               1              100              88101
  4:               1              100              88101
  5:               1              100              88101
 ---                                                    
435:               1              100              88101
436:               1              100              88101
437:               1              100              88101
438:               1              100              88101
439:               1              100              88101
           AQS_PARAMETER_DESC CBSA_CODE                         CBSA_NAME
  1: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
  2: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
  3: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
  4: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
  5: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
 ---                                                                     
435: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
436: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
437: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
438: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
439: PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
     STATE_CODE      STATE COUNTY_CODE  COUNTY year PM2.5      lat       lon
  1:          6 California           1 Alameda 2002  25.1 37.68753 -121.7842
  2:          6 California           1 Alameda 2002  31.6 37.68753 -121.7842
  3:          6 California           1 Alameda 2002  21.4 37.68753 -121.7842
  4:          6 California           1 Alameda 2002  25.9 37.68753 -121.7842
  5:          6 California           1 Alameda 2002  34.5 37.68753 -121.7842
 ---                                                                        
435:          6 California           1 Alameda 2022   3.6 37.68753 -121.7842
436:          6 California           1 Alameda 2022   7.2 37.68753 -121.7842
437:          6 California           1 Alameda 2022   4.8 37.68753 -121.7842
438:          6 California           1 Alameda 2022   0.6 37.68753 -121.7842
439:          6 California           1 Alameda 2022   1.5 37.68753 -121.7842
county_level <- merged_data %>%
  group_by(COUNTY, year) %>%
  summarize(county_mean_PM2.5 = mean(PM2.5, na.rm = TRUE))
`summarise()` has grouped output by 'COUNTY'. You can override using the
`.groups` argument.
site_level <- merged_data %>%
  group_by(year) %>%
  summarize(site_mean_PM2.5 = mean(PM2.5, na.rm = TRUE),
            site_min_PM2.5 = min(PM2.5, na.rm = TRUE),
            site_max_PM2.5 = max(PM2.5, na.rm = TRUE))

# Exploratory Plot 

ggplot(site_level, aes(x = year, y = site_mean_PM2.5)) +
  geom_bar(stat = "identity", position = "dodge", fill = "red", color = "black") +
  labs(title = "Average PM 2.5 Concentration at Site Level",
       x = "Year",
       y = "Average PM 2.5 Concentration")